LAPOR! EDA by Septi Rito Tombe

Introduction

This data fetched from http://data.go.id/dataset/data-aspirasi-dan-pengaduan-masyarakat. It contains reports from Indonesian citizens by a platform called LAPOR!. This platform enables the officials to solve public issues that posted by citizens and it also inform the citizen if the issue was solved. Originally, the data was consist of two csvs, the first csv (data.csv) contains the report’s content and its attributes such as creator, area, and category. The other csv (daftararea.csv), contains the area details including its latitude and longitude. Eventually, the csvs was merged based on “area” as shown in the wrangle.R file. However, some attributes were deleted because it contains redundancy with the other attributes or may contains sensitive materials, such attribute is the report’s content.

The rationale behind this data is to explore the usage of the platform. Some questions may be answered through the analysis such as: 1. When is the reports most frequently posted? 2. Proportion of solved reports. 3. Where are mostly the reports originated? 4. Are there any interesting value regarding th idle time of each report? 5. etc… (I intent to create 5W + 1H questions as necessary as possible while exploring the data)

This report will contain four main sections. The first three section will contain univariate, bivariate, and multivariate data exploration. Finally, the last section will summarise all the findings and present a reflection.

Univariate Plots Section

## [1] 79565    12
## 'data.frame':    79565 obs. of  12 variables:
##  $ X                   : int  3 6 7 8 12 13 16 17 18 20 ...
##  $ area                : Factor w/ 2614 levels "32 Ilir","36 Ilir",..: 1 2 3 3 4 4 4 4 4 4 ...
##  $ latitude            : num  -3 -3.01 -2.95 -2.95 -4 ...
##  $ longitude           : num  105 105 105 105 103 ...
##  $ id                  : int  40295 70931 85523 86690 125 23088 47025 24064 51765 12281 ...
##  $ reporter            : Factor w/ 19638 levels " Andar djati prakoso,sh 0001808331974",..: 5056 5080 15924 15451 18974 7322 14828 7910 3620 7925 ...
##  $ category            : Factor w/ 95 levels "","Administrasi Kependudukan",..: 7 7 23 52 88 9 88 9 70 9 ...
##  $ related_department  : Factor w/ 381 levels "Aceh - Kanim Kelas II Lhokseumawe",..: 380 297 67 130 298 380 295 380 319 380 ...
##  $ status              : Factor w/ 3 levels "Belum","Proses",..: 3 3 3 3 3 3 3 3 3 3 ...
##  $ report_issued       : int  1387261022 1423534297 1445360400 1447299913 1350285493 1375017450 1393683374 1375526630 1400039139 1373075872 ...
##  $ report_last_activity: int  1414723890 1429862802 1446449675 1447419098 1353390928 1415349499 1429167543 1413862028 1402627972 1406019615 ...
##  $ report_closed       : int  1415815214 1431021603 1447351205 1448474403 1354557601 1416420015 1430244004 1414951210 1428343204 1407693606 ...
##        X                                     area          latitude      
##  Min.   :     3   Bandung                      :10416   Min.   :-10.725  
##  1st Qu.: 30729   Kota Bandung                 : 5875   1st Qu.: -7.204  
##  Median : 73174   Daerah Khusus Ibukota Jakarta: 3617   Median : -6.644  
##  Mean   : 75359   Jawa Tengah                  : 2428   Mean   : -5.769  
##  3rd Qu.:127446   Tamansari                    : 2059   3rd Qu.: -6.145  
##  Max.   :160944   Grogol                       : 1840   Max.   :  5.836  
##                   (Other)                      :53330                    
##    longitude           id                  reporter    
##  Min.   : 95.3   Min.   :    1   Anonim        : 8283  
##  1st Qu.:106.8   1st Qu.:36696   62817911xxxx  :  616  
##  Median :107.6   Median :60368   628787550xxxx :  580  
##  Mean   :108.6   Mean   :53768   628180203xxxx :  304  
##  3rd Qu.:110.8   3rd Qu.:74808   628778071xxxx :  273  
##  Max.   :140.7   Max.   :87591   628133191xxxx :  270  
##                                  (Other)       :69239  
##                                 category    
##  Topik Lainnya                      :15085  
##  Infrastruktur                      : 9913  
##  Reformasi Birokrasi dan Tata Kelola: 9567  
##  Kartu Indonesia Pintar (KIP)       : 4566  
##  Beras Miskin (Raskin)              : 4199  
##  Pendidikan                         : 4154  
##  (Other)                            :32081  
##                                          related_department
##  Tim Sosialisasi KKS                              : 5792   
##  Tim Sosialisasi Kebijakan Penyesuaian Subsidi BBM: 5636   
##  Dinas Pekerjaan Umum                             : 4823   
##  Dinas Perhubungan                                : 3726   
##  Badan Penyelenggara Jaminan Sosial Kesehatan     : 2738   
##  Dinas Perhubungan (Dishub) Kota Bandung          : 2408   
##  (Other)                                          :54442   
##      status      report_issued       report_last_activity
##  Belum  :   34   Min.   :1.338e+09   Min.   :1.339e+09   
##  Proses : 1979   1st Qu.:1.384e+09   1st Qu.:1.390e+09   
##  Selesai:77552   Median :1.416e+09   Median :1.418e+09   
##                  Mean   :1.407e+09   Mean   :1.412e+09   
##                  3rd Qu.:1.428e+09   3rd Qu.:1.431e+09   
##                  Max.   :1.449e+09   Max.   :1.449e+09   
##                                                          
##  report_closed      
##  Min.   :1.341e+09  
##  1st Qu.:1.391e+09  
##  Median :1.419e+09  
##  Mean   :1.413e+09  
##  3rd Qu.:1.432e+09  
##  Max.   :1.449e+09  
## 

Below are the overall overview of the dataset

As shown in the chart, the reports are mostly finished (Selesai) and less than 1% is still waiting to be processed. I wonder, when are these reports issued.

## Warning in format.POSIXlt(as.POSIXlt(x), ...): unknown timezone 'zone/tz/
## 2017c.1.0/zoneinfo/Australia/Melbourne'

As shown on the histograms above, there are some similarities and differences between the overall status and the waiting data. The similarities can be found in monthly and daily basis, such as the highest report issued in January and least reports issued on weekends. In the other hand, on yearly basis it shows the graphs are contradicted. This make me wonder how long it typically an issue to be solved? I will cover this in Bivariate section.

Furthermore, lets check another time series variables. Such last activity, I wonder how long typically an acitvity being idle/processed (report_last_activity != report_closed).

## Warning: Removed 815 rows containing non-finite values (stat_bin).

The histogram is highly skewed showing that time for waiting to be processed/processed time is relatively low. However, there are outliers indicated by longtail to be investigated. Furthermore, there is something interesting in the daily basis of waiting time, it shows increasing towards 500, while before it reach a slope.

##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##    0.000    4.346   13.758   54.918   41.136 1249.927

I want to investigate the report_closed, when mostly the reports finished.

As shown, there are some unusual distribution on the daily histogram, lowest activity presented in Friday and Saturday. In contrast, In general in Indonesia, weekend are Saturday and Sunday. The monthly data shows that mostly reports closed in March and November.

Univariate Analysis

What is the structure of your dataset?

The data consists of 79565 reports in 2012-2015. Originally, the data consist of 11 features; eventually, the data expanded to 17 features.

What is/are the main feature(s) of interest in your dataset?

The main features of the data is the report is the status, area, idle_time and the time series (report_issued, report_last_activity, and report_closed). I intend to investigate areas that have best response in time and finished rate. Furthermore, I also want to investigate the same thing with the related_department.

What other features in the dataset do you think will help support yourinto your feature(s) of interest?

Additional features like the duration between issued time vs closed time would be a good feature to answer the questions. Moreover, with latitude and longitude I can visualise the data to be more comprehensive in map.

Did you create any new variables from existing variables in the dataset?

Yes, I created count of each status to find proportion. I also created some new features regarding the time data. For example, the report_issued.month is purposed to group the data into montly basis. Another example is idle_time indicating duration between report_issued and report_last_activity.

Of the features you investigated, were there any unusual distributions?  Did you perform any operations on the data to tidy, adjust, or change the form  of the data? If so, why did you do this?

There is unusual distribution shown in Waiting status, it is contrasting with the overall status. The waiting status shows the data to be declining; in contrast, the overall data shows to be increasing. Another unusual distribution also found in monthly issued data, it shows mostly reports were issued in January. Another anomaly also shown in reports closed on day basis, it is interesting that there is no reports closed in Saturday but there are some reports closed on Sunday. It is interesting because the reports should closed by the officials that should be not working on Sunday. (However, I still unsure maybe if the user delete the report, it also counted as report closed)

There are some actions was conducted to adjust the data. For example, there are some negative values in the duration timeseries, it might be caused by null values in the one of the feature, thus, in visualisation and summary I limit the data from 0. This was done to reduce confusion in presenting the data.

Bivariate Plots Section

Here I want to get insights from factors variables. I have three factor variables that I supposed may have meaning (reporter, category, related_department). Nevertheless, these variables has a high degree of level. Thus, I need to subset it to most common ones (50 or 100 levels) for the sake of readability. To do that, I want to create variables of these factors’ levels’ counts and sort it descending.

At first, I want to see the distribution of these variables to the idle time.

I guess, this figure shows that the daily basis waiting time’s anomaly. In detail, there are some provinces and cities that have similar pattern (high completion around 0-100 and 400-500). But still, I wonder are there any variables that produces same pattern?

In regards of the categories, it is interesting that there are some categories that shows high variability in term of duration to finish it. For example (the plot should be enlarged to find all category), Jamkesmas, Raskin, ESDM, and BLSM. However, it does not produce same pattern as earlier plot. Let see another variables. There is something special here for user “6281680xxxx”, it seems his/her report mostly done in a long duration (I think it has the same fashion as the daily basis waiting time anomaly). Let see what is it?

Apparently, there are two categories that may done in long period which are, Topik Lainnya (other topic) and Lingkungan hidup dan penanggulangan bencana (Enviromental and Disaster Countermeasures), both of them are enlisted as top 10 report category. I think, it because the clarity of the report content so it requires time to determine which department that has to deal with it, or it supposed to took long time to resolve a disaster.

I want to continue at the main variables first before I dive too deep for this user.

It seems there is no similar pattern with the previous charts.However I want to see the relation between waiting time and when the reports issued

## Warning: Removed 25484 rows containing missing values (geom_point).

It is interesting that this shows that every weekend there are small amount of reports finished, although there are some which might be urgent reports. However, on most case, reports are done only in one day (see Monday - Thursday), while on Friday, whether they intend to finish it in same day or on Monday after the weekends. Furthermore, on Saturday and Sunday, it seems there is very little report, thus there is no significant number of common duration. Furthermore, I want to see the relations between numerical values such as report_closed, report_issued, and idle_time.

Correlation

## [1] 0.934973
## [1] -0.2976611
## [1] 0.01581374

Scatterplots I will try to use log10 transformation on the idle_time.day to normalise the data.

## Warning in eval(e, x, parent.frame()): NaNs produced

It seems report_issued and report_closed have a high correlation, which may mean the reports relatively done in shorter time. This was also confirmed with the other plot that shows there is very small (almost none) correlation between the report_issued and idle_time. Furthermore, the plot shows that the waiting time relatively small. However, there are some cases that the finish time are somehow clustered on ~500 days. I want to see what time exactly it is.

## Warning in eval(e, x, parent.frame()): NaNs produced
## Warning: Removed 77046 rows containing missing values (geom_point).

## [1] 1376629673
## [1] 1374816478
## [1] 1374223489

Respectively, above are the mean, median, and mode of the previous plot. When I tried to convert the timestamp into local time (GMT+7) and did some research on Google, I found that on these months (July-Aug 2013) there was some natural disasters occured. Which confirmed another previous plots that shows anomaly in range of ~500 idle time. I wonder why it took too long (almost 2 years to solve). Either it were not updated until it closed automatically, or the issues truly took almost 2 years to be resolved. Ref: - https://www.epochconverter.com/timezones?q=1376629673&tz=Asia%2FJakarta - https://reliefweb.int/report/indonesia/indonesia-humanitarian-snapshot-july-august-2013

Bivariate Analysis

Talk about some of the relationships you observed in this part of the  investigation. How did the feature(s) of interest vary with other features in  the dataset?

In this section, I mostly explore the relation of idle_time/waiting time with the other attributes. It is interesting that the other attribute may have impact or pattern with the waiting time.

Did you observe any interesting relationships between the other features  (not the main feature(s) of interest)? What was the strongest relationship you found?

As I mentioned before, I mostly explore the relation between idle_time and the other features. However, I also tried to see relationship between report_issued and report_closed. I found that the reports mostly finished in relatively short time.

Multivariate Plots Section

In this section I want to visualise the data in a map to gain better understanding via geographical picture. But first, I want to look indepth about the user 6281680xxxx to comprehensive exploration about the issue.

It seems the report mostly issued on 2013-2014. However, it seems very little data for 2012. In the other hand, 2015 reports are commonly done earlier. Furthermore, I want to see where are commonly this reporter issuing the reports.

## Map from URL : http://maps.googleapis.com/maps/api/staticmap?center=-6.235983,106.765273&zoom=10&size=640x640&scale=1&maptype=terrain&sensor=false

It seems it only around Jakarta. However, the longer idle time seems clustered into a certain point. It may indicating he she reporting a same issue overtime. I am getting over my curiosity about this user, as I move on to general distribution of the data.

## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?address=Indonesia&sensor=false
## Map from URL : http://maps.googleapis.com/maps/api/staticmap?center=-0.789275,113.921327&zoom=4&size=640x640&scale=1&maptype=terrain&sensor=false
## Warning: Removed 1 rows containing missing values (geom_rect).
## Warning: Removed 5 rows containing missing values (geom_point).

As seen in the graph, the fastest response area is located in East Java and slowest ones are Celebes, Borneo, and North Sumatera. Furthermore, the status mapping.

## Warning: Removed 1 rows containing missing values (geom_rect).
## Warning: Removed 5 rows containing missing values (geom_point).

The chart shows that mostly the data are done. However in some area, there are balanced proportion between ‘Processing’ and ‘Done’ such as at the Island Sumatera and Island Halmahera.

Multivariate Analysis

Talk about some of the relationships you observed in this part of the . Were there features that strengthened each other in terms of at your feature(s) of interest?

Yes, especially for the user 6281680xxxx, it shows where are the reports’ origin that took too long be finished.

Were there any interesting or surprising interactions between features?

The data are scattered on the regional area. For example, in summary, the proportion of the finished reports are way too high compared with the other status; however, it seems just because data are highly clustered in Island Java, while in the other Islands the proportion seems not so distinct.


Final Plots and Summary

Plot One

## Warning: Removed 244 rows containing non-finite values (stat_bin).

Description One

In this chart, it is interesting that there are some data that finished on a long period (> 300 days).

Plot Two

## Warning in eval(e, x, parent.frame()): NaNs produced
## Warning: Removed 77046 rows containing missing values (geom_point).

Description Two

Eventually, I found one user that have a similar pattern to that anomaly. It shows there are some reports clustered in 400-500 days duration. Thus, I am curious about when are these reports issued. I found that these data are issued on these months (July-Aug 2013).

Plot Three

## Warning: Removed 1 rows containing missing values (geom_rect).
## Warning: Removed 5 rows containing missing values (geom_point).

Description Three

Finally, I found an anomaly on the status of the reports, it seems propotions of ‘Done’ and ‘In Progess’ are different among islands. It seems the data are clustered at Java with mostly ‘Done’. However, in the other islands such as Sumatera, Borneo, and Halmahera has indistinct difference.


Reflection

I am collecting my data from a data repository. Originally, I have no knowledge about the data that being used in this analysis and there is no particular intention why I am using this data, it was simply because I saw it as available data that comply with the requirements. Furthermore, I thought there will be only a little information that I can extract from the data because I do not know the relation between the features.

However, it took a long time to be able work with this dataset. There are bad values here and there, which led me took longer time compared to if I am using the prepared dataset. It made me understands that data wrangling and cleaning took mostly the time for analysis. In addition, I have to took a long contemplation to understand connection between features and which combinations that may convey meaning and value.

Thus, I have came up with some different ways to exploit the data. It was suprising that I can some unique findings (such as the waiting time duration anomaly), because at first I was pessimist about the data will show any significant value. These make me thought for the future, that I should have a better understanding about the data that I am going to dive in. I think this will save time of the contemplation.Eventually, I realised these all are only part of a great exploration.